Skip to content

Conversation

@xxmplus
Copy link
Contributor

@xxmplus xxmplus commented Oct 1, 2025

Log full path to WAL file in disk slow info so it's explicit if the WAL file is on the primary or secondary.

Enhance the log writer to instrument all filesystem operations done for writing the log file in the latency histogram (create, write, sync, etc.).

Fixes #5328

@xxmplus xxmplus requested a review from a team as a code owner October 1, 2025 03:28
@xxmplus xxmplus requested a review from annrpom October 1, 2025 03:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@xxmplus xxmplus force-pushed the i5328-improve-metrics branch 2 times, most recently from e1d6b17 to 4c5809a Compare October 1, 2025 03:35
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 19 files reviewed, 2 unresolved discussions (waiting on @annrpom)


record/log_writer_latency_test.go line 17 at r2 (raw file):

)

func TestWALMetricsWithErrorfsLatency(t *testing.T) {

Is this test reliable? In general, relying on timings leads to flakes. Can you run it with --exec 'stress -p 64' and with --exec 'stress -p 64' --race?

More generally, this is a LOT of testing code. ~1500 lines of code for a relatively simple addition - these tests will need to be maintained, sometimes debugged. I don't think it's worth it. Some things are hard to add unit tests for and that's ok, especially when it's an observability thing that has no bearing on correctness.


record/log_writer.go line 837 at r2 (raw file):

		writeStart := crtime.NowMono()
		_, err = w.w.Write(data)
		if writeLatency := writeStart.Elapsed(); w.flusher.walFileMetrics.WriteLatency != nil {

[nit] the assignment here is strange (it's not used in the condition). I'd just inline writeStart.Elapsed().

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the test code yet, but in a cursory glance I found the sheer volume a little daunting. We try to hold our test code to a similar standard of simplicity as production code, often uncoupling test data/cases from test code that performs the mechanics. I'm wondering if we can test these changes with less code, perhaps leaning on either datadriven tests (uncoupling data and test code) or randomized testing (which tends to achieve higher coverage with less test code to maintain).

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @annrpom and @xxmplus)


metrics.go line 568 at r2 (raw file):

		CloseLatency   prometheus.Histogram // File close operations
		StatLatency    prometheus.Histogram // File stat operations
		OpenDirLatency prometheus.Histogram // Directory open operations

I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.

Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)


metrics.go line 568 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.

These ops have quite different latency and frequency access patterns. If we consolidate them into one, we will lose visibility to the less frequent ops. Is that still ok?


record/log_writer_latency_test.go line 17 at r2 (raw file):

Previously, RaduBerinde wrote…

Is this test reliable? In general, relying on timings leads to flakes. Can you run it with --exec 'stress -p 64' and with --exec 'stress -p 64' --race?

More generally, this is a LOT of testing code. ~1500 lines of code for a relatively simple addition - these tests will need to be maintained, sometimes debugged. I don't think it's worth it. Some things are hard to add unit tests for and that's ok, especially when it's an observability thing that has no bearing on correctness.

I ran it against stress but not race. And ok, I will remove some verbose tests.

@xxmplus
Copy link
Contributor Author

xxmplus commented Oct 8, 2025

I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.

These ops have quite different latency and frequency access patterns. If we consolidate them into one, we will lose visibility to the less frequent ops. Is that still ok?

I had a chance to sync up with Jackson. While they have different latency profiles, we care more about how file ops impact as a whole. I will update the code to consolidate them and also, not increase the number of metrics.

@xxmplus xxmplus force-pushed the i5328-improve-metrics branch 3 times, most recently from 2350d6b to 2d250f8 Compare October 14, 2025 02:53
Log full path to WAL file in disk slow info so it's explicit if the
WAL file is on the primary or secondary.

Fixes cockroachdb#5328
@xxmplus xxmplus force-pushed the i5328-improve-metrics branch from 2d250f8 to 3d8fdf3 Compare October 22, 2025 19:45
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbowens reviewed 1 of 2 files at r1, 1 of 17 files at r4, all commit messages.
Reviewable status: 2 of 19 files reviewed, 7 unresolved discussions (waiting on @annrpom, @RaduBerinde, and @xxmplus)


open.go line 295 at r4 (raw file):

	walFileOpHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{
		Name:    "pebble_wal_file_op_duration_nanoseconds",
		Help:    "Histogram of WAL file operation latencies in nanoseconds",

nit: I think we should just exclude the Name and Help and sections; it'll be unused in the context of Cockroach and may confuse


metrics.go line 578 at r4 (raw file):

		prometheus.LinearBuckets(0.0, 0.01, 50),                 // 0 to 0.5ms in 10μs increments
		prometheus.ExponentialBucketsRange(1.0, 10000.0, 50)..., // 1ms to 10s
	)

I think we can unify the histogram configuration, consolidating FsyncLatencyBuckets and WALFileOpLatencyBuckets


db.go line 383 at r4 (raw file):

				// protected by mu.
				fsyncLatency prometheus.Histogram
				// Updated whenever a wal.Writer is closed.

nit: can we keep the comment that metrics.LogWriterMetrics is updated when a WAL writer is closed?


wal/failover_writer.go line 463 at r4 (raw file):

	// WAL file operation latency histogram
	walFileOpHistogram record.WALFileOpHistogram
	directoryType      record.DirectoryType

the failover handles writing to both the primary and secondary. I think we need access to both histograms and be able to construct child LogWriters that use one or the other depending on which device the child LogWriter is writing to.

@xxmplus xxmplus force-pushed the i5328-improve-metrics branch from 3d8fdf3 to 63a4485 Compare October 30, 2025 23:11
Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 19 files reviewed, 7 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)


db.go line 383 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: can we keep the comment that metrics.LogWriterMetrics is updated when a WAL writer is closed?

ack. added to WALMetrics definition in metrics.go


metrics.go line 578 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think we can unify the histogram configuration, consolidating FsyncLatencyBuckets and WALFileOpLatencyBuckets

ack. done.


open.go line 295 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I think we should just exclude the Name and Help and sections; it'll be unused in the context of Cockroach and may confuse

removed.


wal/failover_writer.go line 463 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

the failover handles writing to both the primary and secondary. I think we need access to both histograms and be able to construct child LogWriters that use one or the other depending on which device the child LogWriter is writing to.

ack.

@xxmplus xxmplus force-pushed the i5328-improve-metrics branch from 63a4485 to 84b5370 Compare October 30, 2025 23:38
@xxmplus xxmplus requested a review from jbowens October 30, 2025 23:55
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@RaduBerinde
Copy link
Member

Cost: $0.0210 | Duration: 8.7s

Claude just gave new meaning to "my two cents"

Enhance the log writer to instrument all filesystem operations done
for writing the log file in the latency histogram (create, write, sync, etc.).
@xxmplus xxmplus force-pushed the i5328-improve-metrics branch from 84b5370 to 4f01c47 Compare November 5, 2025 19:58
@xxmplus
Copy link
Contributor Author

xxmplus commented Nov 5, 2025

I believe the AI review bot found a real bug. The PR has been updated accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

db: improve metrics and logs with multiple disks being used by a DB

4 participants